Paranjpe etal. BMC Genomics 201 3, 14:762 
http://www.bionnedcentral.conn/1471-2164/14/762 



Genomics 



RESEARCH ARTICLE OpenAccess 



A genome-wide survey of maternal and 
embryonic transcripts during Xenopus 
tropica I is development 

Sarita S Paranjpe\ Ulrike G Jacobi^'^, Simon J van Heeringen^ and Gert Jan CVeenstra^"" 



Abstract 

Background: Dynamics of polyadenylation vs. deadenylation determine tlie fate of several developmentally 
regulated genes. Decay of a subset of maternal mRNAsand new transcription define the maternal-to-zygotic 
transition, but the full complement of polyadenylated and deadenylated coding and non-coding transcripts has not 
yet been assessed in Xenopus embryos. 

Results: To analyze the dynamics and diversity of coding and non-coding transcripts during development, both 
polyadenylated mRNA and ribosomal RNA-depleted total RNA were harvested across six developmental stages and 
subjected to high throughput sequencing. The maternally loaded transcriptome is highly diverse and consists of both 
polyadenylated and deadenylated transcripts. Many maternal genes show peak expression in the oocyte and include 
genes which are known to be the key regulators of events like oocyte maturation and fertilization. Of all the transcripts 
that increase in abundance between early blastula and larval stages, about 30% of the embryonic genes are induced 
by fourfold or more by the late blastula stage and another 35% by late gastrulation. Using a gene model validation 
and discovery pipeline, we identified novel transcripts and putative long non-coding RNAs (IncRNA). These IncRNA 
transcripts were stringently selected as spliced transcripts generated from independent promoters, with limited 
coding potential and a codon bias characteristic of noncoding sequences. Many IncRNAs are conserved and 
expressed in a developmental stage-specific fashion. 

Conclusions: These data reveal dynamics of transcriptome polyadenylation and abundance and provides a 
high-confidence catalogue of novel and long non-coding RNAs. 

Keywords: Xenopus tropicalis, RNA-seq, Maternal and embryonic transcriptome, Polyadenylation, Deadenylation, 
MBT, Codon bias, Long-noncoding RNAs 



Background 

Innovations in sequencing technology have allowed deep 
sequencing of complementary DNA (cDNA), known as 
ribonucleic acid sequencing (RNA-seq), enabling tran- 
scriptome assembly and identification of coding and non- 
coding transcripts across many cell types [1-4]. 

Transcriptome profiling studies have been undertaken 
in zebrafish, using polyadenylated (polyA"^) selected 
messenger RNA (mRNA). These studies have reported 
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identification of thousands of maternal genes and identi- 
fied the earliest set of embryonic transcripts. They also 
identified a large number of novel transcribed regions 
in annotated and unannotated regions of the zebrafish 
genome [5,6]. In Xenopus, several deep-sequencing stud- 
ies have created different libraries of small RNAs from 
oocytes, eggs, gastrula, liver and skin [7-9]. A gastrula 
stage polyadenylated (polyA"^) selected RNA-seq pro- 
file was used to identify transcribed loci, to enhance 
gene annotation and to analyze spatial regulation of gene 
expression [10]. Recently, similar polyA"^ libraries of mul- 
tiple stages of development were published [11]. For the 
analysis of transcriptome dynamics it is important to 
appreciate that, like many other vertebrates, the Xenopus 
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maternal-to-zygotic transition involves two important 
processes: first, destabilization of a subset of maternal 
mRNAs; second, onset of transcription at the mid-blastula 
transition (MBT) [12-14]. Studies in Xenopus laevis iden- 
tified distinct phases of maternal, late embryonic and 
larval gene expression during the course of embryoge- 
nesis, whereas microarray analysis in Xenopus tropicalis 
identified several developmentally important maternal 
mRNAs that are regulated by changes in their adeny- 
lation during oogenesis and early development [15,16]. 
Cytoplasmic polyadenylation is essential for the meiotic 
maturation of the oocyte as it mediates translational 
activation of mRNAs encoding mos kinase and mitotic 
cyclins involved in early rapid synchronous cell divisions 
[17-20]. Several maternally polyadenylated mRNAs lose 
their polyadenylated tails after fertilization. In most cases, 
this is mediated by an embryonic deadenylation element 
(EDEN) in the 3'untranslated region (UTR) of the mRNA, 
which binds embryonic deadenylation element - binding 
protein (EDEN-BP) [21]. Processes that regulate mRNA 
deadenylation and degradation are temporally uncoupled. 
Deadenylated RNAs are as stable as their polyadenylated 
counterparts until the blastula stage, several hours after 
fertiUzation [22]. 

For developmental analysis it is important to establish 
the dynamics and scale of maternal transcript destabi- 
lization on a genome-wide level and to identify the full 
complement of embryonic transcripts, including as of yet 
unannotated and long non-coding RNAs (IncRNAs), the 
analysis of which will be facilitated by transcriptome pro- 
filing using polyA^ or total RNA-sequencing. Here, we 
present results from polyA^ and ribosomal RNA depleted 
total-RNA (RiboZero, RZ) sequencing. Our study distin- 
guishes changes in polyadenylation and abundance, which 
is critical for the the analysis of early transcript dynam- 
ics and proper identification of maternal and embryoni- 
cally induced transcripts. The embryonic genome shows 
a gradual cascade of activation, which involves only a 
third of the number of genes expressed in the oocyte. 
By expanding and updating our previously published 
Xenopus tropicalis experimentally validated (Xtev) anno- 
tation pipeline, we also identified 2,135 new transcripts 
resulting in a total collection of 29,663 gene models. 
These new transcripts do not overlap with gene mod- 
els in the Xenopus model organism database (Xenbase) 
[23]. Using stringent filtering criteria and manual cura- 
tion, 31 transcripts were identified as "stand-alone" 
IncRNA transcripts. We characterize these transcripts 
in terms of exon number, transcript length, conserva- 
tion and expression pattern during embryogenesis and 
thus anticipate that our catalogue of coding and long 
non-coding transcripts will enable more developmental 
and genomic studies directed towards dissecting their 
functional roles. 



Results 

Deep sequencing of PolyA^ and total RNA libraries 

To systematically analyze the transcriptome during early 
development, we performed polyA"^ and total RNA 
(RiboZero, RZ) sequencing experiments across Xenopus 
tropicalis embryogenesis based on three biological repli- 
cates (Figure la): (1) Oocytes (PA, RZ); (2) early blas- 
tula (stage 6; PA, RZ); (3) late blastula (stage 9, after 
MBT; PA, RZ); (4) gastrula (stage 12; PA); (5) neu- 
rula (stage 16; PA) and (6) an early larval stage (stage 
30; PA). We verified abundance of transcripts in bio- 
logical replicates with real time RT- PGR (RT-qPCR) 
using random hexamers. We not only find minimum 
variance in transcript abundance among replicates, but 
also the stage-dependent expression dynamics are sim- 
ilar for the repUcates (Additional file 1: Figure SI a). 
Total RNA was independently extracted for each bio- 
logical replicate, subjected to polyA"^ or ribosomal RNA 
depleted total RNA (RZ) enrichment protocols, quality- 
controlled, pooled and converted into complementary 
DNA sequencing libraries for the lUumina Genome 
Analyzer platform (Additional file 1: Table SI, see 
Materials and methods). 

Gene expression was calculated as reads per kilobase 
of exon model per million mapped reads (RPKM, see 
Materials and methods) and shows a comparable median 
distribution across sequencing libraries (Figure lb). 
Heatmap representation of the Pearson correlation coef- 
ficients reveal the similarity within the early (oocyte, 
stage 6, stage 9) and the late (stage 12, stage 16, stage 
30) transcriptomes respectively (Figure Ic). There are 
major changes in the PA transcriptome marking mei- 
otic maturation and fertiUzation (oocyte, stage 6) and the 
maternal-to-zygotic transition (stages 6, 9, 12; Figure Ic). 
The total RNA (RZ) profiles of the early developmen- 
tal stages correlate relatively well with each other, espe- 
cially between oocyte and stage 6, most likely due to the 
presence of stable maternal RNAs and the early embryo 
being transcriptionally quiescent. Gorrelation between 
the stages is higher for total RNA than the polyA"^ data, 
most likely due to changes in polyadenylation of mater- 
nal mRNA. To rule out any bias in correlation arising 
from low expression values, we filtered the data for a 
threshold of 1 RPKM in oocyte (PA and RZ data). The 
Pearson correlation heatmap of the filtered data shows 
a similar profile (Figure Ic, Additional file 1: Figure 
Sib). The correlation between same stages in differ- 
ent data sets (PA and RZ), while moderate, is highly 
significant (p<10~^^), reflecting the representation of 
most transcripts in both types of libraries (Figure Id, 
Additional file 1: Table S2). A Spearman's rank order 
correlation analysis strongly underscores the similarities 
in the total RNA and polyA+ data (Additional file 1: 
Figure Sic). The data are also in good agreement with 
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Figure 1 Generation of RNA-sequencing libraries, (a) Developmental stages of Xenopus tropicolis. (b) RPKM distribution across six 
developmental time-points. Numbers on the x-axis are Xenopus tropicalis Nieuwkoop and Faber developmental stages, Oocyte (Oo), stage 6, stage 9, 
stage 1 2, stage 1 6 and stage 30. (c) Heat map to show Pearson correlation of expression (RPKM) between all 9 RNA-seq libraries, (d) Scatter plots to 
show stage specific Pearson correlation between RNA-seq data generated using two different methods. Log2 RPKM values are plotted on x and y 
axis respectively. PolyA+ (RNA harvested with double PolyA+ selection), RZ (ribosomal rRNA depleted-total RNA). 
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previously published ribonucleic acid sequencing (RNA- 
seq) and microarray data ([11,24], Additional file 1: 
Figure S2a, b). 

Abundance and polyadenylation state of maternal and 
maternal-embryonic transcripts 

To investigate the maternal contribution to the transcrip- 
tome of the developing embryo, we used polyA^ and 
total RNA-seq data to classify maternal and embryonic 
transcripts. A set of 9,513 transcripts were called as mater- 
nally expressed as they met a filtering criteria of RPKM 
greater than or equal to 1 in oocyte (PA or RZ data) 
(Additional file 2: Page - Maternal genes). This set of 
transcripts includes maternal-embryonic genes, which are 
transcribed in both oocytes and embryos. Our mater- 
nal subset also includes well known mouse maternal 
genes like c-mos, zp2y zp3 and SLBP (for more exam- 
ples see Additional file 2: Page - Maternal genes) [25,26]. 
The polyA^ and total RNA-seq pool of transcripts are 
likely to have different complexity, and while their RPKM 
expression levels cannot be compared directly, the ratio 
of these two measures (PA/RZ) may reflect the relative 
polyadenylation state. This would allow us to examine the 
polyadenylated state of all transcripts during early devel- 
opment. Using this strategy we initially compared the 
PA/RZ RPKM ratios for genes like cdk2(Egl), kifll(Eg5) 
and hesZl, which are known to be deadenylated after 
fertilization [15,22], and indeed, their PA/RZ ratios faith- 
fully reflect their fertilization-induced deadenylated state 
(Figure 2a, Additional file 1: Figure S3a). Genome-wide 
average PA/RZ ratios show an overall abundance of 
polyadenylated maternal gene products at stage 9 rela- 
tive to earlier stages (Additional file 1: Figure S3b). These 
abundant polyA^ transcripts arise either from polyadeny- 
lation of the maternally derived message, or new tran- 
scription (maternal-embryonic transcripts). We validated 
the adenylation states of transcripts using RT-qPCR in 
combination with oligo(dT)2o primers and random hex- 
amers (Additional file 1: Figure S3c). With one exception 
{sox3)y polyA^ ribonucleic acid sequencing (RNA-seq) 
data and RT-qPCR with oligo(dT)2o primers correlated 
very well. To analyze the deadenylated transcripts more 
systematically, we filtered the log ratio of two measures 
(PA/RZ) to be less than or equal to -0.5 for oocyte, 
stage 6 and stage 9 respectively. This filtering gave us 
sets of genes that are highly enriched in RZ data com- 
pared to the polyA"^ data (oocyte : 2,675, stage 6 : 5,118, 
stage 9 : 2,016 genes, see Additional file 2: Page - RZ- 
enriched genes). Interestingly we find stage 6 to be highly 
enriched in deadenylated transcripts. This may reflect the 
fertilization-induced deadenylated state, which has been 
reported in the literature [27-30]. In order to gain an 
insight in to the functional categories of deadenylated 
transcripts at this stage, we compared enrichment of gene 



ontology (GO) terms of biological processes (BP) using 
DAVID [31]. We extracted the stage-specific functional 
annotation charts from DAVID and compared them using 
clusterProfiler, an R package for comparing gene clusters, 
with a p-value cut off of 0.01 [32]. Interestingly most of 
the stage 6 transcripts show an enrichment of biologi- 
cal processes related to cell-cycle control and regulation 
(Additional file 1: Figure S3d). For example important cell- 
cycle regulators like aurkuy cdc25c and bircS.l belong to 
stage 6 RZ-enriched fraction. Several chromatin reorga- 
nizers and modifiers like ezh2y cbx4 and hdacS are also 
enriched in the stage 6 RZ- enriched fraction. 

To assess patterns of genome-wide polyadenylation dur- 
ing early development in more detail, we used K-means 
clustering (Figure 2b). Clusters 1, 3, and 4 are groups 
of maternally abundant polyadenylated transcripts and 
include well known genes like ccnbl, aurkb, mos, emil 
and maskin. These genes show peak expression in the 
oocyte and are deadenylated or degraded in a stage- 
specific manner (Additional file 1: Figures S3a and GO 
term enrichment for cluster 3 - Figure S4a). Cluster 5 rep- 
resents 12% of all the maternally loaded transcripts which 
are relatively deadenylated. This cluster includes histone 
variants like histlh2ad, histlh2aU which are known to 
exist as deadenylated transcripts [33,34], emi2 which is 
well studied for its role in unfertilized eggs where it, 
along with its partner mos causes arrest at metaphase 
of meiosis II (see GO term enrichment for cluster 5 in 
Additional file 1: Figure S4a) [35,36]. Cluster 6 transcripts 
are polyadenylated during early development. A notable 
gene in this cluster is celfl, which codes for embry- 
onic deadenylation element - binding protein (EDEN-BP), 
known to mediate sequence-specific mRNA deadenyla- 
tion [37-39]. Cluster 7 represents a group of genes that 
seem to be loaded as relatively deadenylated messages in 
the oocyte and are then polyadenylated post-fertilization 
or post-MBT (see GO term enrichment for cluster 7 
in Additional file 1: Figure S4a). Overall 59% (clusters 
1,2,3,4) of transcripts are deadenylated during oocyte mat- 
uration and early post-fertilization development, whereas 
57% (clusters 1,3,5,6) show a higher relative polyadeny- 
lation state in late blastulae compared to early blastu- 
lae. Motif analysis of 3' ends of transcripts clustered in 
Figure 2b show a significant enrichment of deadenylation 
and polyadenylation elements (ARE, EDEN and eCPE) in 
several clusters (Additional file 1: Figure S4b). 

To gain insight into the fate and temporal expression 
patterns of maternally-abundant polyadenylated tran- 
scripts after the blastula stage, we compared the polyA^ 
data from six stages (oocyte, stage 6, stage 9, stage 
12, stage 16 and stage 30) using K-means clustering 
(Figure 2c). Cluster 2 includes aurkb, a mitotic serine- 
threonine kinase, which declines in abundance post-MBT. 
We find that genes like tcf^ and oct91 from cluster 3 
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Figure 2 Total and Polyadenylated RNA profiles of the Maternal Transcriptome. (a) Barplots to show gene specific distribution of log2 RPKM 
ratios during early development, (b) Heatmap to show stage specific comparison between PolyA+ and RZ data. The barplots to the right of the 
figure represent average PolyA+ and RZ ratios per stage for the same cluster numbered to the left of the heatmap. Gene names are representative 
examples from the corresponding cluster, (c) Heatmap to show abundance of polyadenylated maternal genes from six developmental time points. 
Gene names are representative examples from the corresponding cluster. The heatmaps (b and c) show scaled expression values (the sum of 
expression per gene across all stages is set to one). PolyA+(RNA harvested with double PolyA+ selection), RZ (ribosomal rRNA depleted-total RNA). 
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have different profiles of abundance during develop- 
ment. oct91, a homologue of the mammalian pluripo- 
tency factor oct3/4y peaks in abundance at late gastrula 
(stage 12) and declines drastically thereafter (Additional 
file 1: Figure S5f) [40]. On the other hand tcfSy a gene 
encoding a helix-loop-helix transcription factor responsi- 
ble for mesoderm and axis formation as well as anterior 
forebrain development via repression of wnt/beta-catenin 
targets, dramatically peaks at blastula and then exists as 
a stable polyadenylated transcript up to organogenesis 
(Additional file 1: Figure S5b) [41-44]. This analy- 
sis shows that the abundance of many maternally 
loaded polyadenylated transcripts declines after late 
blastula. 

Progressive activation of the embryonic genome 

An embryonic set of 2,481 transcripts was obtained by 
filtering the data for a 10-fold increase in any of the 
stages compared to oocyte expression. This set includes 
transcripts which are transcribed around or immedi- 
ately after the mid-blastula transition (MBT). Concomi- 
tant polyadenylation could stabilize the newly synthesized 
transcripts and mark them for translation. To exam- 
ine whether this is the case we compared the relative 
adenylation status of embryonic transcripts expressed at 
stage 9 to that of maternally abundant stage 9 tran- 
scripts. This comparison revealed that a large number 
of stage 9 embryonic transcripts are highly polyadeny- 
lated compared to their maternal counterparts, although 
the polyA"^ state is also more variable in the embry- 
onic subset of the late blastula transcriptome than in the 
maternal-embryonic subset (Figure 3a). The polyadenyla- 
tion distribution of this variable embryonic transcriptome 
at stage 9 is robust and does not change with filtering 
criteria of embryonic transcripts (fold increase relative 
to oocyte levels) (Figure 3a). To explore this variable 
transcriptomic landscape during embryogenesis, we used 
K-means clustering to group genes according to their 
expression (log-transformed RPKM values) or according 
to their expression relative to maximum gene expres- 
sion (scaled expression per gene) (Figures 3b and d). This 
revealed that many genes are activated at the MBT, but 
most are more strongly induced at later stages (Figure 3b). 
Comparing the numbers of expressed transcripts relative 
to the total embryonic pool of transcripts reveals that 
about 30% of the embryonic genes are induced by fourfold 
or more during blastula stages and another 36% by late 
gastrulation (Figure 3c). Clustering gene expression rela- 
tive to maximum gene expression reveals exquisitely well- 
defined clusters of dynamic gene expression (Figure 3d). 
Genes which peak in the late blastula (Figure 3d, cluster 1) 
include nodal 3,2/5/6 and sial/2, respectively signalling 
and transcription factors important for the Spemann- 
Mangold organizer (Additional file 1: Figure S5d and g) 



[45,46]. Gastrula expression represented by cluster 5 con- 
tains 8% of genes peaking in expression (Figure 3e). This 
cluster is dominated by high expression of genes involved 
in mesendoderm specification and patterning such as 
eomesy chrd, gsc, and t (Additional file 1: Figure S5g). 
Cluster 3 and 4 shows genes peaking during neurulation 
and include genes like neurog, pax6, nkx2-S, wnt4 and 
myod. Cluster 2 and cluster 6 together represent clusters 
of genes peaking in expression at stage 30 of development 
and include transcription factors involved in hematopoi- 
etic development like gatal and tall, celfB is an embryonic 
gene which belongs to the CELF family of genes that code 
for RNA binding proteins involved in deadenylation of 
mRNAs. It is known to be exclusively expressed in the 
nervous system including domains in the brain, spinal 
cord, optic and otic vesicles [47] . irxS, a homeobox tran- 
scription factor known for its role in neural patterning 
also peaks during organogenesis [48]. As the embryonic 
genome is progressively taking control of development, 
we found 50% of the developmentally induced genes reach 
maximum expression late in development (Figure 3e). 

To correlate these temporal profiles with gene function, 
enrichment for GO terms of BP were examined using 
DAVID [31] . Using gene names as unique identifiers, we 
found a significant enrichment of stage-specific processes 
(Figure 4a). We then extracted the stage-specific func- 
tional annotation charts from DAVID and compared them 
using clusterProfiler, an R package for comparing gene 
clusters, with a p-value cut off of 0.01 (Figure 4b) [32]. 
Cluster 1 (from Figure 3d) is a small cluster of 31 genes 
and shows enrichment of terms like nucleosome assem- 
bly, a term with 3 genes {hist2h2ab, histlh4k, histlj2aj) 
(blue arrowhead. Figure 4b). The stage 12-specific genes 
(cluster 5, from Figure 3d) show enrichment for the term 
gastrulation and include well-known genes (for example 
bmp4, cerl,fgf8y gsCyfoxa2, nodal, eomes, lefl, Ihxlyfoxcly 
foxc2, chrd, gata4) (red arrowhead. Figure 4b). In con- 
clusion, apart from confirming genes enriched in terms 
with known functions, our GO analysis also provides a 
framework of hypothesis for several genes with unknown 
function. 

Experimental validation of gene models and analysis of 
novel transcripts 

To improve gene annotation and identify potentially 
novel transcripts, we updated our previously published 
Xenopus tropicalis experimentally validated (Xtev) anno- 
tation pipeline [10]. Using more sequencing data and 
the latest genome build, we performed guided tran- 
script assembly with Cufflinks using all our polyA"^ and 
total RNA-seq data with JGI 7.1 annotation as refer- 
ence [49,50] and combined the Cufflinks transcripts with 
expressed sequence tags (EST) clusters (Gurdon clusters, 
courtesy of Dr. Mike Gilchrist). Both histone H3 lysine 
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Figure 3 Overview of the Embryonic Transcriptome. (a) Density plot to show distribution of Maternal-Embryonic (grey) and Embryonic (red) 
ratios of polyA+ vs. RZ expression (RPKM) at Stage 9. (b) Heatmap to show dynamic expression of 2,481 polyadenylated embryonic genes. Scale 
represents the log2 transformed RPKM values. Gene names are representative examples from the corresponding cluster, (c) A pie-chart to show 
percentage of genes whose expression is increased four folds or more relative to Oocyte, (d) A heatmap to show scaled expression (the sum of 
expression per gene across all stages is set to one) of 2,481 polyadenylated embryonic genes. Gene names are representative examples from the 
corresponding cluster, (e) A pie-chart to show percentage of embryonic genes peaking in expression per stage. 
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exocrine system development (7) - 



pvalue 
0.002 
0.004 
0.006 
0.008 



Figure 4 Gene Ontology Analysis of the Embryonic Transcriptome. (a) GO term enrichment analysis from DAVID. Barplots (i) Stage 1 2, (ii) Stage 
1 6, (ill) Stage 30 show stage-specific significant Biological Processes and their -log P-values plotted on x-axis. (b) A plot to cluster and visualize 
DAVID-derived GO terms from developmental stages 9, 1 2, 1 6 and 30 using R package clusterProfiler with a p-value cut off < 0.01 [32]. The DAVID 
GO terms have been derived from biological process annotation of Xenopus tropicolis genes. 
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4 tri-methylation (H3K4me3) and RNA polymerase II 
(RNAPII) chromatin immuno-precipitation sequencing 
(ChlP-seq) data were used to validate or update the 5' 
ends of the gene models as described previously [10] (see 
Additional file 1: Figure S6, Additional file 3: Page - Gene 
models). The annotation pipeline resulted in a collection 
of 29,663 Xenopus tropicalis spliced gene models out of 
which 18,305 were validated or updated by the Xenopus 
tropicalis experimentally validated (Xtev) pipeline. From 
these validated models, 17,592 (96%) can be detected 
by RNA-seq and 65% have H3K4me3 enrichment at the 
annotated start site. Several thousand gene models were 
updated and/or reannotated leading to addition of 5^ 3^ 
or internal exons (for a complete overview of Xtev(v3.4) 
known gene model update see Additional file 1: Figures S6 
and S7a). In addition 2,135 spliced transcripts were newly 
identified on basis of RNA-seq and/or EST evidence. 

As a by-product of our gene annotation pipeline, we 
find evidence for a total of 33,601 single exon unspliced 
gene models. These unspliced single exon gene models 
are filtered out early on in the pipeline and have not been 
analyzed further (for a complete list with genomic co- 
ordinates see Additional file 3: Page - Single exon gene 
models ). These single exon models include MALATl 
(metastasis associated lung adenocarcinoma transcript 1), 
a known single exon IncRNA conserved in mammals, 
zebrafish and Xenopus [51]. From the expression data it 
appears to be most abundant at the at neurula stage, sug- 
gestive of a specialized stage-dependent regulatory role 
(Additional file 1: Figure S7b). 

Compared to our pubUshed annotation pipeUne, where 
we only validated and updated known gene models, which 
are mostly protein-coding, the new implementation is 
more inclusive in annotating both coding and long non- 
coding RNAs. To identify new gene models we looked for 
Cufflinks transcripts lacking any overlap with gene mod- 
els from the Xenopus model organism database (Xenbase) 
[23]. We found 2,135 gene models to be new and non- 
overlapping (new gene models (NGM), Figure 5a, for a 
complete list with genomic co-ordinates see Additional 
file 3: Page -NGM). Out of this set, 594 gene models are 
supported by both expression data (RNA-seq, EST) and 
a S' H3K4me3 modification peak (new gene models with 
validation support (NGM-vv), see Additional file 1: Figure 
S6, for a complete list with genomic co-ordinates co- 
ordinates see Additional file 3: Page - NGM-vv), meaning 
that they are likely stand-alone transcripts. To validate 
some of these transcripts, we looked at their relative 
abundance by performing RT-qPCR of RNA from several 
developmental stages (stage 9, stage 10, stage 10.5 and 
stage 12). We find stage-specific expression (Additional 
file 1: Figure S8a). 

To assess the coding potential of the various gene 
model subsets (Xtev gene models (GM), NGM, NGM-w) 



we used open reading frame (ORE) length and 
the log likelihood ratio (LLR) of codon bias (see 
Materials and methods). We find that 65% of new tran- 
scripts that are not overlapping with known genes (NGM 
subset), show a codon bias score comparable to non- 
coding genomic sequences (Figure 5b). Also, 70% of these 
NGM transcripts have a maximum ORE length com- 
parable to non-coding genomic sequences (Figure 5c). 
By contrast, transcripts from annotated protein-coding 
genes {Xenopus tropicalis mRNA) and all GM have large 
ORE lengths and higher codon bias score, the cumulative 
distribution of which is well-separated from that of non- 
coding genomic DNA. However it should be noted that 
the subset of new non-overlapping transcripts that has 
strong experimental validation support (594, NGM-w) 
may contain both coding and non-coding transcripts. 60% 
have an ORE length of less than 100 amino acids and also 
the codon bias distribution suggests this subset repre- 
sents a mixed population of coding and non-coding RNAs 
(Figure 5b and c). To enrich for long non-coding RNAs 
(IncRNAs), we curated the NGM-vv subset both bioinfor- 
matically and manually. Since high-confidence IncRNAs 
have been shown to lack an ORE length greater than 100 
amino acids [4], we used this as a first step to enrich for 
putative IncRNAs. Inspection of this subset of gene mod- 
els revealed that it was enriched for the 5' untranslated 
region (UTR) exons of downstream gene models with- 
out H3K4me3 peak and a number of other artifacts. We 
therefore excluded new gene models that were upstream 
of the known genes without a H3K4me3 peak, or did not 
meet more stringent expression and splicing criteria (see 
Materials and methods). The resulting 98 gene mod- 
els were screened using BLASTN and BLASTP against 
homology to known protein coding sequences and an 
array of other problems on the genome browser. For 
example models in regions with many gaps were excluded. 
Also, several intronic transcripts were selected against as 
they may not represent independent transcription units. 
This screening resulted in a set of 37 transcripts. We com- 
pared our different subsets of transcripts with the new 
transcripts identified by Tan et al, [11]. The Tan study 
reports the identification of 13,836 novel transcripts, a 
set of transcripts which collapses to a set of 3,726 unique 
models that map to the JGI7.1 genome assembly and are 
not included in the JGI7.1 annotation. Only 122 of these 
models overlap with the NGM set of 2,135 transcripts. 
All of the 37 transcripts we selected were identified 
in the Tan study, however six of these were Unked to 
protein-coding genes in their multi-stage ribonucleic acid 
sequencing (RNA-seq) data sets. Therefore, our stringent 
filtering and curation approach has generally enriched 
for transcripts identified as new gene models in both 
studies and we conclude that the remaining 31 transcript 
models are likely stand-alone transcription units. This 
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Figure 5 Analysis of Novel transcripts, (a) Subsets of gene models from the updated Xenopus tropicolis gene annotation pipeline, (b and c) 
Cumulative frequency chart to show distribution of codon bias (LLR score) and ORF length between new gene models (NGM), all gene models 
(GM), new gene models with validation support (NGM-vv), random genomic sequences (Genomic seq.) and Xenbase extracted X.tropicolis mRNAs 
{X.trop mRNA). 



subset is referred to as NGM-vvo: manually curated new 
gene models with H3K4me3 peaks and RNA-seq evi- 
dence, and a longest ORF length of 100 amino acids. In 
terms of assessing coding metrics, this set has a codon 
bias score comparable to random genomic sequences 
and their cumulative distribution is well separated from 
known protein-coding mRNA (Figure 6a). An example 
is shown in Figure 6b (two more examples shown in 
Additional file 1: Figure S9a). Compared to the protein- 
coding transcripts, the manually curated new gene 
models with ORF less than 100 amino acids (NGM-wo) 
transcripts have low exon number and relatively shorter 
transcripts (Figures 6c and d), similar to what has been 
reported for IncRNAs [52,53]. Based on our manual cura- 
tion, coding potential metric and gene-structure analysis, 
we conclude that NGM-vvo subset represents a set of 
high-confidence IncRNAs. 

Mammalian and zebrafish IncRNA show a mean expres- 
sion varying from a 3-fold to a 10-fold difference from 



their protein-coding counterparts [53,54]. To investigate 
the expression of NGM-wo IncRNAs during embryo- 
genesis, we looked at their polyA^ mRNA profiles dur- 
ing embryogenesis. We find that the median expression 
level of IncRNAs during embryogenesis is one-third of 
their protein-coding counterparts (Figures 7a and b and 
Additional file 1: Figure S8b). To investigate expression 
patterns of NGM-vvo transcripts, we performed unsu- 
pervised hierarchical clustering of polyA^ and total-RNA 
expression profiles during embryogenesis. Just like the 
protein-coding genes, some NGM-vvo transcripts have 
maternal expression. Many are developmentally regulated 
and show a stage-specific peak in expression (Figure 7c 
and Additional file 1: Figures S8b). 

The GENCODE v7 catalogue of human IncRNAs has 
looked into conservation of human IncRNAs using phast- 
Cons analysis [52]. Our IncRNA conservation results 
are in line with these analyses, since we find NGM- 
vvo exons to be less conserved than annotated protein- 
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Exon Number Transcript length (in nt) 

Figure 6 Analysis of NGM-vvo transcripts, (a) Cumulative frequency chart to show distribution of codon bias (LLR score) for NGM-vvo, random 
genomic sequences (Genomic seq.) and Xenbase extracted X.tropicolis mRNAs {X.trop mRNA). (b) An example to illustrate NGM-vvo gene model. 
H3K4me3 peak demonstrates the gene being transcribed from its own promoter [1 0]. (c and d) Frequency distribution to compare number of 
exons and transcript length (nt, nucleotides) between all gene models (GM) and new gene models (NGM-vvo). 



coding mRNA, but more conserved than the random 
genomic sequences (Figure 7d). The evolutionary con- 
straints on their sequence and their developmentally 
regulated expression may be an indication of their 
stage-specific functionality. 

Discussion 

Our results present temporal profiles of maternal and 
embryonic transcripts during early development. We 
report a total number of 14,819 non-redundant Xenbase 



transcripts expressed in any of the six assayed stages 
from oocyte to tailbud embryos and mapped to Xenopus 
tropicalis genome assembly (Joint Genome Institute 
(JGI) 7.1). In our data set the maternal transcriptome 
consists of over 9,000 transcripts that show differen- 
tial adenylation and of these 46% are abundant in the 
oocyte polyA^data (Figure 2c). This is interesting in view 
of the fact that the oocyte serves as a reservoir of sta- 
ble maternal transcripts which drive early development 
in the absence of embryonic transcription. To better 
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Figure 7 Expression analysis of putative long non-coding RNAs (NGM-vvo). (a) Boxplot to show log transformed expression (RPKM, PolyA^) 
across six developmental stages in the NGM-vvo subset, (b) Density graph to compare stage-9 expression (RPKM, PolyA+) between all gene models 
(GM) and NGM-vvo subset, (c) Heat map to show unsupervised hierarchical clustering of expression (RPKM) of polyA^ and RZ data across 
embryogenesis. Colorscale represents deviation from mean expression calculated row-wise, (d) Density plot to compare distribution of logio 
transformed conservation score (phastCons analysis, see Materials and methods) between random genomic sequences (Genomic Seq), Xenbase 
extracted X.tropicalis mRNAs {X.trop mRNA) and NGM-vvo subset. PolyA+(RNA harvested with double polyA+ selection), RZ (ribosomal rRNA 
depleted-total RNA). 



understand the dynamics of polyadenylated vs. dead- 
enylated mRNA, we compared the ratio of polyA^ and 
total RNA. This comparison gave us a tool to examine 
the dynamics of transcript abundance and polyadenyla- 
tion during early development. We observed fertilization- 
induced deadenylation of several cell cycle regulators like 
cdk2(Egl), kifll(Eg5) as already reported [16]. Also, it 
is interesting to note that there is an exclusive pool of 
relatively deadenylated transcripts, which in our analy- 
sis accounts for 12% of the maternal genes and includes 
well known non-adenylated transcripts like the histone 
mRNAs (Figure 2b). Transcription has been reported to 



start at the mid-blastula stage [12,13,55], although a num- 
ber of genes are transcribed before this stage. We find little 
evidence of pre-MBT transcription at stage 6 in our data. 
Many maternal transcripts are gaining polyadenylation 
during post-fertilization development and may appear as 
false-positives in an analysis of early transcription if only 
polyA^ messages are considered. Between stage 6 and 
stage 9, some genes are activated early as described for 
several nodal genes [56]. The embryonic transcript abun- 
dance is stage-specific. About 30% of the embryonic genes 
are induced four-fold or more by late blastula and another 
35% by late gastrulation (Figure 3c). 50% of the genes 
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peak late in development (Figure 3e). Our GO analy- 
sis of embryonic genes provides confirmation of genes 
with known functions as well as provides a framework 
for hypothesis for several genes with unknown functions 
(Figure 4b). 

We have generated an updated annotation pipeline for 
Xenopus tropicalis experimentally validated (Xtev) gene 
models, featuring a total of 31,157 transcripts. Of these, 
we find 2,135 gene models to be new, however many 
of these may be linked to known transcripts and are 
not independently generated. The NGM-vvo subset of 31 
transcripts is a high-confidence set of IncRNAs, which 
shares many of the characteristic features of IncRNAs 
such as low exon number, relatively short length and 
overall low expression during embryogenesis [4,51,53]. 
They are decorated with H3K4me3 histone modifica- 
tion at the 5' end, evidence that these high-confidence 
IncRNAs are transcribed from their own promoter. Like 
their protein-coding counterparts, the expression profile 
of high confidence IncRNAs (NGM-vvo) is stage-specific 
and temporally restricted. 

It proved surprisingly difficult to identify this high- 
confidence set of IncRNAs. This is because any selection 
for novel or unannotated transcripts enriches result- 
ing subsets for annotation problems (broken genes) and 
assembly problems (poorly assembled regions with frag- 
mented genes). Our high-confidence approach may how- 
ever underestimate the true number of IncRNAs that are 
expressed during embryogenesis. First of all, IncRNA may 
be transcribed from complex loci and not all may meet our 
criteria of stand-alone transcripts. Second, many IncRNAs 
are expressed at very low levels. Inclusion of more RNA- 
seq data is therefore likely to identify more IncRNAs. 
On the other hand, true stand-alone IncRNA transcrip- 
tion units, produced from their own promoter, may be far 
less common than frequently assumed, and the majority 
of "new transcripts" may arise as by-products of known 
genes or be transcribed from highly complex loci. Also, 
RNA-seq alignment tools produce artifacts and multiple, 
sometimes incorrect, models for the same locus. There- 
fore, approaches that integrate expression and histone 
modification data are essential to curate transcription 
units. 

Functional analyses of the novel IncRNAs are required 
to elucidate their potential roles in pre-MBT transcrip- 
tional repression, gastrulation, neurulation and organo- 
genesis. Our catalogue of high confidence stand-alone 
IncRNAs with sequence conservation and stage-specific 
expression provides a prioritized resource for studies in 
IncRNA function during vertebrate development. 

Conclusions 

We provide a comprehensive survey of the Xenopus trop- 
icalis transcriptome using polyA"^ and ribosomal-RNA 



depleted total RNA expression data. These results provide 
insights into the maternal and embryonic components 
of expression and polyadenylation dynamics through-out 
early embryogenesis. In addition, our improved annota- 
tion has led to the discovery of new transcripts which 
constitute subset of high-confidence stand-alone IncR- 
NAs. Together, these data provide a rich developmentally 
relevant resource, integration of which will enable new 
genomic and genetic studies in the near future. 

Materials and methods 

Animal procedures 

Xtropicalis embryos were obtained with in vitro fertil- 
ization from three separate crosses (different outbred 
animals). Briefly, both females and males were primed 
with 10 units of human chorionic gonadotropin (hCG- 
pregnyl, Organon). Four to six hours before embryo col- 
lection, female frogs were injected with 200 units of hCG. 
Forty-five minutes after the onset of egg laying, embryos 
were collected and dejellied in 3% cysteine hydrochloride 
(pH 8.0) in 10% MMR. Embryos were then cultured in 10% 
MMR at room temperature and were staged according to 
Nieuwkoop and Faber (1994) [57]. Embryos from three 
separate clutches were harvested and frozen at -80°C until 
RNA isolation. Stage VI oocytes were harvested by treat- 
ing ovarian follicles with coUagenase [Clostridium type I 
coUagenase, Sigma). 

RNA preparation and sequencing 

Oocyte and embryos from Nieuwkoop-Faber stages 6-30 
were collected and total RNA was isolated using Trizol 
and the QIAGEN RNeasy Kit. Subsequently, polyadeny- 
lated RNA was selected by enriching with the Oligotex 
mRNA kit (QIAGEN). To ensure complete removal of 
ribosomal RNA (rRNA), polyadenylated mRNA was sub- 
jected to an additional round of Oligotex treatment. 
Total RNA was subjected to depletion of ribosomal RNA 
(rRNA) using Ribozero Epicenter low input kit. Two 
important quality control measures were taken to confirm 
removal of ribosomal RNA (rRNA). First, the ribosomal 
RNA (rRNA) -depleted sample was tested on a RNA-chip 
(Experion, BIORAD) in comparison with non-depleted 
total RNA. Absence of 28S and 18S peaks in the ribosomal 
RNA (rRNA) depleted sample confirmed good deple- 
tion. Second, RT-qPCR with primers against 28S, 5S and 
GAPDH was performed. 28S RNA levels were less than 
5%, typically around 1% after depletion, whereas GAPDH 
was typically at more than 80% of the levels before deple- 
tion. For sequencing, cDNA was prepared for both polyA^ 
and RZ samples with random hexamer primers using 
Superscript III (Invitrogen) and the second strand was 
made with DNA polymerase I, DNA ligase and T4 DNA 
polymerase. The purified double-stranded cDNA was 
used for lUumina sample preparation. All quality control 
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qPCR reactions were performed on a MylQ single-color 
reader real-time PGR detection system (BioRad) using iQ 
SYBR Green Supermix (BioRad). 

The three biological replicates were checked by RT- 
qPGR and pooled for sample preparation and sequencing. 
These samples were then processed according to the man- 
ufacturer s protocol (lUumina). Shortly, adapter sequences 
were linked to the complementary DNA (cDNA) samples, 
the library was size selected (300-350bp), and ampli- 
fied by polymerase chain reaction (PGR). The subse- 
quent sequencing was carried out on Genome Analyzer 
(lUumina). 

RNA-seq expression analysis 

On average, we obtained about 16-50 million reads per 
stage (Additional file 1: Table SI). Out of the total reads 
about 50-60% could be aligned to the genome assem- 
bly (JGI 7.1) of the Xenopus tropicalis genome sequence. 
To allow a quantitative comparison all reads were nor- 
malized before analysis. The transcript list contains all 
the genes that are expressed (= non zero RPKM) in at 
least one stage. The RPKM per gene is the mean of all 
RPKM of all the non-redundant exons of all isoforms per 
gene. The total list contains around 15,289 genes of which 
only 470 are not detected as expressed in any stage. All 
unknown/unnamed gene names have been changed to 
include the genomic position for reasons of identifica- 
tion. Alignment was performed using Burrows-Wheeler 
Aligner (BWA), reads mapping to multiple positions (non- 
unique) were not included in the RPKM calculation [58]. 

Xtev (v3.4) gene annotation pipeline 

Gene models (JGI 7.2) were downloaded from Xenbase 
(http://www.xenbase.org) and EST clusters mapped to 
the JGI 7.1 X.tropicalis genome were supplied by Mike 
Gilchrist (NIMR). Spliced transcription units were gener- 
ated from the RNA-seq data. All reads were mapped to 
JGI 7.1 using TopHat v2.0.4 [50]. The TopHat output was 
filtered to keep only new splice sites with evidence of at 
least 5 spliced reads. The filtered TopHat output was used 
with Cufflinks vl.3.0 to perform transcript assembly [49]. 
The experimental annotation pipeline consists of several 
steps: 1) collect gene models; 2) update with experimental 
data; 3) validate and/or update gene models with RNA- 
seq data; 4) validate and/or update transcription start-site 
(TSS) with H3K4me3 and/or RNAPII ChlP-seq data and 
5) Choose the most likely model (Additional file 1: Figure 
S6). All Xenbase gene models sharing at least 1 exon 
were considered as multiple models of a single gene. The 
EST clusters and transcripts determined by Cufflinks were 
used to update the gene models with extra putative exons, 
mainly at the 5' and 3' end of genes. The number of RNA- 
seq reads was determined for all exons of all models. If 1/3 
of the exons of a model contained at least 3 RNA-seq reads 



the model was considered as expressed. If the first exon 
of a gene model overlapped with a H3K4me3 peak, the 
TSS was considered as validated. If no single model of a 
gene had a validated TSS, we looked for evidence of a TSS 
upstream. In this case there had to be a H3K4me3 peak 
upstream of a gene model, with no different gene mod- 
els in between, and the mean RNAPII level of the region 
between the upstream H3K4me3 peak and the 5^ exon of 
the gene had to be at least 0.5 of the mean RNAPII level of 
the gene body. Furthermore, all gene models were checked 
for evidence of a downstream H3K4me3 peak, which can 
indicate a putative alternative TSS. For each single gene 
the most likely model was chosen according to the follow- 
ing criteria, in order of decreasing importance: a validated 
TSS, number of expressed exons, number of exons. Of the 
new models, which are not present in the Xenbase JGI 7.2 
annotation, only spliced transcripts were included. 

Analysis of coding potential 

Coding potential of RNA sequences was determined 
using maximal ORE length and codon bias metrics, 
as described [59]. The codon bias metric is based on 
unequal usage of synonymous codons. Briefly, triplet 
frequencies were determined in non-coding genomic 
X.tropicalis DNA (JGI4.2, GL172663:l-4,425,020, UCSC 
table browser basepair-wise intersection of comple- 
mented Human Proteins with UCSC xenTro3 assembly), 
whereas Xtropicalis codon frequencies were downloaded 
from http://www.kazusa.or.jp/codon/cgi-bin/showcodon. 
cgi?species=8364. Log likelihood ratios (LLRs) for each 
codon were calculated based on the codon frequency con- 
ditional of the encoded amino acid, such that for each 
codon i coding for amino acid a/, LLR/ = log2(c//n/) , in 
which Ci and n/ correspond to the likelihood of codon i 
conditional on amino acid a/ in coding and non-coding 
sequences respectively (Additional file 4). The total LLR 
score is determined by summing LLR^ values in all 90 bp 
windows in six potential reading frames. After computing 
a score for windows, the max LLR score was defined as the 
maximum score observed in all windows of the transcript. 

Quantitative RT-qPCR for known and new gene model 
(NGM subset) validation 

Validation of known and novel transcripts was performed 
on total-RNA which was subjected to depletion of riboso- 
mal RNA (rRNA) using Ribozero Epicenter low input kit. 
Total RNA was then DNase treated and column purified 
to remove any contaminating genomic DNA. cDNA was 
prepared with oligo(dT)2o primers or random hexamers 
using Superscript III (Invitrogen). qPCR reactions were 
performed on a MylQ single-color reader real-time PGR 
detection system (BioRad) using iQ SYBR Green Super- 
mix (BioRad). With Stage 9 as reference, fold change was 
calculated by normalizing Ct values in Stages 10, 10.5 
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and 12 against odd gene using the 2 ^^^^ method (for 
primer sequences see Additional file 5) [60]. 

Bioinformatic and manual curation of NGM-vv subset 

NGM-w subset is collection of 594 new gene models. As 
a first step, we filtered these models for ORF length less 
than or equal to 100 amino acids. This resulted in a set 
of 331 gene models, which were then screened using fol- 
lowing criteria: 1) Absence of a downstream gene (same 
orientation) with a X or U annotation for H3K4me3 (see 
flowchart for Xtev pipeline in Additional file 1: Figure S6); 
2) RPKM of all exons should be greater than or equal 
1 (to filter out models where our data does not support 
the model) and 3) Evidence of splicing in our data. This 
resulted in a set of 98 gene models. These models were 
then manually curated using BLASTN and BLASTP to fil- 
ter against homology to known protein coding sequences 
(as described in the main text). 

Conservation (phastCons) analysis 

For conservation analysis all gene models were mapped to 
JGI v4.1 (UCSC: xenTro2) using blat. The average phast- 
Cons score per gene model was calculated using the Con- 
servation track (phastCons7way) of the UCSC genome 
browser. 

Data Availability 

The data have been deposited in NCBI's Gene Expres- 
sion Omnibus [61] and are accessible through GEO 
Series accession number GSE43652. Xtev gene models are 
available at: http://veenstra.ncmls.nl/genomedata.asp. 

Statement on animal use 

Animal care and use was in accordance with national 
and European guidelines and standard operating proce- 
dures approved by the institutional animal care and use 
committee (Dierexperimentencommissie, DEC). 
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Additional file 1 : Supplementary tables (SI and S2) and figures 
(S1-S9). 

Additional file 2: Multi-page list of genes expressed in Xenopus 
tropicalis. 

Additional file 3: Multi-page Xtev annotation pipeline gene list with 
genomic co-ordinates. 

Additional file 4: Single-page list of codon usage frequencies used 
for calculating codon bias score. 

Additional file 5: Single-page list of primers used for real-time qPCR 
validation of highly conserved IncRNAs from the NGM subset. 
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acids; LLR: Log likelihood ratio; JGI: Joint Genome Institute; rRNA: Ribosomal 
RNA; BWA: Burrows-Wheeler Aligner; PGR: Polymerase chain reaction; 
H3K4me3: Histone H3 lysine 4 tri-methylation; RZ: Ribosomal RNA depleted 
total RNA; RNAPII: RNA polymerase II; ChlP-seq: Chromatin 
immuno-precipitation sequencing; TSS: Transcription start-site. 
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